Open In Colab

In this section, we utilize the diabetes dataset to illustrate the process of building and interpreting various models with the assistance of AI. We begin by examining the familiar linear regression model, which is widely recognized and comprehended. Next, we delve into the decision tree model, which, while new to some, aligns with our intuition. For example, the decision tree model finds practical application in areas such as the loan approval process within banks and patient diagnosis. Finally, we explore a more intricate mathematical model, highlighting the potential of AI in addressing unfamiliar modeling scenarios.

Prompt:

Show a few rows of the diabetes dataset in scikit-learn as a table.

from sklearn.datasets import load_diabetes
import pandas as pd

# Load the diabetes dataset
diabetes = load_diabetes()

# Create a pandas DataFrame from the dataset
data = pd.DataFrame(data=diabetes.data, columns=diabetes.feature_names)
target = pd.DataFrame(data=diabetes.target, columns=['target'])
df = pd.concat([data, target], axis=1)

# Display the first few rows as a table
display(df.head())
age sex bmi bp s1 s2 s3 s4 s5 s6 target
0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401 -0.002592 0.019907 -0.017646 151.0
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412 -0.039493 -0.068332 -0.092204 75.0
2 0.085299 0.050680 0.044451 -0.005670 -0.045599 -0.034194 -0.032356 -0.002592 0.002861 -0.025930 141.0
3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038 0.034309 0.022688 -0.009362 206.0
4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142 -0.002592 -0.031988 -0.046641 135.0
df.shape[0]*0.8
353.6

Linear Regression

Prompt:

Use the diabetes data from scikit-learn. Preprocess the data as follows:
1) Load the data and assign it to X, and the target to y. The target variable represents disease progression.
2) Rename the columns to make them more understandable, as shown here:

{'age': 'Age', 'sex': 'Sex', 'bmi': 'BMI', 'bp': 'Blood Pressure',
's1': 'Total Cholesterol', 's2': 'LDL', 's3': 'HDL',
's4': 'Thyroid', 's5': 'Glaucoma', 's6': 'Glucose'}

In this context, BP stands for blood pressure, and BMI refers to body mass index. The other six columns correspond to blood sample measurements.

Perform the following steps to analyze the data:
1) Plot the correlation matrix of features as a heatmap.
2) Split the data into training and test sets, allocating 80% of the data for training and 20% for testing.
3) Utilize a linear regression model to predict the target variable.
4) Evaluate the model's performance using the mean absolute error (MAE) metric.
5) Compare the predictions of the model with a baseline model that predicts the median diabetes progression (y) across all patients. Calculate the MAE of this baseline model. Print out both MAEs.

Show python code.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
import numpy as np

# Load the diabetes dataset
diabetes = load_diabetes()

# Assign features (X) and target (y)
X = diabetes.data
y = diabetes.target

# Rename the columns
column_names = {'age': 'Age', 'sex': 'Sex', 'bmi': 'BMI', 'bp': 'Blood Pressure',
                's1': 'Total Cholesterol', 's2': 'LDL', 's3': 'HDL',
                's4': 'Thyroid', 's5': 'Glaucoma', 's6': 'Glucose'}
df = pd.DataFrame(X, columns=[column_names[col] for col in diabetes.feature_names])
df['Disease Progression'] = y

# Show a few rows of the diabetes data (features and target in a table)
print("Diabetes Data (Features and Target):")
display(df.head())

# Plot the correlation matrix as a heatmap
correlation_matrix = df.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title("Diabetes Data - Correlation Matrix")
plt.show()

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Utilize linear regression to predict the target variable
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate the model's performance using MAE
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)

# Calculate the MAE for the baseline model
baseline_pred = [np.median(y)] * len(y_test)
baseline_mae = mean_absolute_error(y_test, baseline_pred)

# Print MAEs
print(f"MAE of the linear regression model: {mae:.2f}")
print(f"MAE of the baseline model: {baseline_mae:.2f}")
Diabetes Data (Features and Target):
Age Sex BMI Blood Pressure Total Cholesterol LDL HDL Thyroid Glaucoma Glucose Disease Progression
0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401 -0.002592 0.019907 -0.017646 151.0
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412 -0.039493 -0.068332 -0.092204 75.0
2 0.085299 0.050680 0.044451 -0.005670 -0.045599 -0.034194 -0.032356 -0.002592 0.002861 -0.025930 141.0
3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038 0.034309 0.022688 -0.009362 206.0
4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142 -0.002592 -0.031988 -0.046641 135.0
MAE of the linear regression model: 42.79
MAE of the baseline model: 62.59

It is a good improvement, but it does not provide information about the model's decision-making process or highlight the key input features that contribute the most to the prediction.

Apparently, explaining the algorithm does not help with this understanding. In general, the ability to describe and explain a machine learning algorithm does not guarantee interpretability.

In the case of linear regression, the final prediction is indeed obtained by taking a weighted sum of the input features. By examining the learned weights, we can determine the impact or importance of each feature in predicting the target variable. This can be achieved by considering the absolute value of the corresponding weight.

Prompt:

Analyze the impact of each feature in a linear regression model in the following steps:
1) Retrieve the weights assigned to each feature.
2) Create a plot of the weights, where the feature names are displayed on the y-axis with different colors. The feature names should be arranged in increasing order of the absolute values of their corresponding weights.

Show python code.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression

# Step 1: Load the data and assign it to X and y
diabetes = load_diabetes()
X = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
y = diabetes.target

# Step 2: Rename the columns
column_names = {
    'age': 'Age',
    'sex': 'Sex',
    'bmi': 'BMI',
    'bp': 'Blood Pressure',
    's1': 'Total Cholesterol',
    's2': 'LDL',
    's3': 'HDL',
    's4': 'Thyroid',
    's5': 'Glaucoma',
    's6': 'Glucose'
}
X.rename(columns=column_names, inplace=True)

# Fit a linear regression model
model = LinearRegression()
model.fit(X, y)

# Step 1: Retrieve the weights assigned to each feature
feature_weights = model.coef_

# Step 2: Create a plot of the weights, arranged by the absolute values of weights
sorted_indices = np.argsort(np.abs(feature_weights))
sorted_feature_names = X.columns[sorted_indices]
sorted_weights = feature_weights[sorted_indices]

plt.figure(figsize=(8, 6))
plt.barh(range(len(sorted_feature_names)), sorted_weights, color=['r' if w < 0 else 'g' for w in sorted_weights])
plt.yticks(range(len(sorted_feature_names)), sorted_feature_names)
plt.xlabel('Weight')
plt.ylabel('Feature')
plt.title('Impact of Features in Linear Regression')
plt.show()

Try to interpret the model.

Decision Tree

A decision tree is a great machine learning algorithm that can be used to model complex nonlinear relationships. It has relatively higher predictive power than linear regression and is also highly interpretable.

Prompt:

Use the diabetes data from scikit-learn. preprocess the data as follows:
1) Load the data and assign it to X, and the target to y. The target variable represents disease progression.
2) Rename the columns to make them more understandable, as shown here:

{'age': 'Age', 'sex': 'Sex', 'bmi': 'BMI', 'bp': 'Blood Pressure',
's1': 'Total Cholesterol', 's2': 'LDL', 's3': 'HDL',
's4': 'Thyroid', 's5': 'Glaucoma', 's6': 'Glucose'}

In this context, BP stands for blood pressure, and BMI refers to body mass index. The other six columns correspond to blood sample measurements.

Perform the following steps to analyze the data:
1) Split the data into training and test sets, allocating 80% of the data for training and 20% for testing.
2) Utilize a decision tree model to predict the target variable.
3) Evaluate the model's performance using the mean absolute error (MAE) metric.
4) Visualize the decision tree.
5) Plot the feature importance for the diabetes dataset using the decision tree model. Display the feature names on the y-axis with different colors. Arrange the feature names in increasing order of importance.

Show python code.

import numpy as np
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error
import matplotlib.pyplot as plt
from sklearn import tree

# Load the diabetes dataset
diabetes = load_diabetes()

# Assign features (X) and target variable (y)
X = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
y = diabetes.target

# Rename the columns
column_names = {'age': 'Age', 'sex': 'Sex', 'bmi': 'BMI', 'bp': 'Blood Pressure',
                's1': 'Total Cholesterol', 's2': 'LDL', 's3': 'HDL',
                's4': 'Thyroid', 's5': 'Glaucoma', 's6': 'Glucose'}
X.rename(columns=column_names, inplace=True)

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a decision tree model
model = DecisionTreeRegressor(max_depth=3, random_state=42)
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model's performance using mean absolute error (MAE)
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error (MAE): {mae}")

# Visualize the decision tree
plt.figure(figsize=(12, 8))
tree.plot_tree(model, feature_names=X.columns, filled=True)
plt.show()

# Plot the feature importance
feature_importance = model.feature_importances_
sorted_indices = np.argsort(feature_importance)
sorted_features = X.columns[sorted_indices]

plt.figure(figsize=(10, 6))
plt.barh(sorted_features, feature_importance[sorted_indices])
plt.xlabel('Feature Importance')
plt.ylabel('Features')
plt.title('Feature Importance for Diabetes Dataset')
plt.show()

### ChatGPT may ignore the requirements on different colors
### User code to generate colored chart
# feature_importance = model.feature_importances_
# sorted_indices = np.argsort(feature_importance)
# sorted_features = X.columns[sorted_indices]
# import matplotlib
# sorted_importances = feature_importance[sorted_indices]
# cmap = matplotlib.colormaps['viridis']
# colors = cmap(sorted_importances)
# plt.figure(figsize=(10, 6))
# plt.barh(sorted_features, feature_importance[sorted_indices], color=colors)
# plt.xlabel('Feature Importance')
# plt.ylabel('Features')
# plt.title('Feature Importance for Diabetes Dataset')
# plt.show()
Mean Absolute Error (MAE): 48.0965923002249

Tasks

  1. set max_depth=3 as follows

    model = DecisionTreeRegressor(max_depth=3, random_state=42)
    

    to understand the prediction of a decision tree which consists of a series of if-else conditions.

  2. set max_depth=None to understand the feature importance.

  3. compare the MAEs of the above settings to understand overfitting. The decision tree model is prone to the problem of overfitting. The value of samples in a node represents the number of data points or instances from the training set that reach that particular node in a decision tree. In our context the sample value would indicate the number of patients whose data is used to make predictions at that node.

Generalized Additive Models (GAMs)

Prompt:

Use the diabetes data from scikit-learn. preprocess the data as follows:
1) Load the data and assign it to X, and the target to y. The target variable represents disease progression.
2) Rename the columns to make them more understandable, as shown here:

{'age': 'Age', 'sex': 'Sex', 'bmi': 'BMI', 'bp': 'Blood Pressure',
's1': 'Total Cholesterol', 's2': 'LDL', 's3': 'HDL',
's4': 'Thyroid', 's5': 'Glaucoma', 's6': 'Glucose'}

In this context, BP stands for blood pressure, and BMI refers to body mass index. The other six columns correspond to blood sample measurements.

Perform the following steps to analyze the data:
1) Split the data into training and test sets, allocating 80% of the data for training and 20% for testing.
2) Utilize a generalized additive model (GAM) to predict the target variable.
3) Evaluate the model's performance using the mean absolute error (MAE) metric.
4) Plot the effect of each of the patient features on the target variable with 95% confidence intervals.

Show python code.

I have a hard time making ChatGPT give me executable code without errors. You can simply copy/paste the error message to ChatGPT and ask it to correct the code, or ignore Step 4) at first to simplify the tasks for ChatGPT. Here are a few notes regarding the code ChatGPT provided: In building the GAM model, we use

gam = LinearGAM(s(0) + f(1) + s(2) + s(3) + s(4) + s(5) + s(6) + s(7) + s(8) + s(9), n_splines=35)

where f(1) instead of s(1) because the feature Sex is categorical.

When train the model, instead of using

gam.fit(X_train, y_train)

we can use

gam.gridsearch(X_train, y_train)

to perform training and determine optimal parameter values.

# may run this cell twice
!pip install pygam
Requirement already satisfied: pygam in /usr/local/lib/python3.10/dist-packages (0.9.0)
Requirement already satisfied: numpy<2.0.0,>=1.24.2 in /usr/local/lib/python3.10/dist-packages (from pygam) (1.25.1)
Requirement already satisfied: progressbar2<5.0.0,>=4.2.0 in /usr/local/lib/python3.10/dist-packages (from pygam) (4.2.0)
Requirement already satisfied: scipy<2.0.0,>=1.10.1 in /usr/local/lib/python3.10/dist-packages (from pygam) (1.10.1)
Requirement already satisfied: python-utils>=3.0.0 in /usr/local/lib/python3.10/dist-packages (from progressbar2<5.0.0,>=4.2.0->pygam) (3.7.0)
Requirement already satisfied: typing-extensions>3.10.0.2 in /usr/local/lib/python3.10/dist-packages (from python-utils>=3.0.0->progressbar2<5.0.0,>=4.2.0->pygam) (4.7.1)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from pygam import LinearGAM, s, f, l
from sklearn.metrics import mean_absolute_error

# Step 1: Load the data and assign it to X, and the target to y
diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target

# Step 2: Rename the columns
column_names = {'age': 'Age', 'sex': 'Sex', 'bmi': 'BMI', 'bp': 'Blood Pressure',
                's1': 'Total Cholesterol', 's2': 'LDL', 's3': 'HDL', 's4': 'Thyroid',
                's5': 'Glaucoma', 's6': 'Glucose'}
feature_names = [column_names[col] for col in diabetes.feature_names]

# Step 3: Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Utilize a generalized additive model (GAM) with grid search for hyperparameter tuning
gam = LinearGAM(s(0) + f(1) + s(2) + s(3) + s(4) + s(5) + s(6) + s(7) + s(8) + s(9), n_splines=35)

# Step 5: Fit the model on the training data
# gam.gridsearch(X_train, y_train)
gam.fit(X_train, y_train)

# Step 6: Evaluate the model's performance using mean absolute error (MAE) metric
y_pred = gam.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error (MAE): {mae:.2f}")

# Step 7: Plot the effect of each patient feature on the target variable with 95% confidence intervals
fig, axs = plt.subplots(2, 5, figsize=(15, 6))
axs = axs.ravel()

for i in range(X.shape[1]):
    XX = gam.generate_X_grid(term=i)
    partial_dependence, confidence_intervals = gam.partial_dependence(term=i, X=XX, width=0.95)

    axs[i].plot(XX[:, i], partial_dependence, color='blue')
    axs[i].fill_between(XX[:, i], confidence_intervals[:, 0], confidence_intervals[:, 1], color='lightblue', alpha=0.7)
    axs[i].set_xlabel(feature_names[i])
    axs[i].set_ylabel('Target Variable')
    axs[i].set_title(f'Effect of {feature_names[i]}')

plt.tight_layout()
plt.show()
Mean Absolute Error (MAE): 41.38
# # Here is another block of working code generated by ChatGPT

# import numpy as np
# import pandas as pd
# from sklearn.datasets import load_diabetes
# from sklearn.model_selection import train_test_split
# from pygam import LinearGAM, s, f
# from sklearn.metrics import mean_absolute_error
# import matplotlib.pyplot as plt

# # Load the diabetes dataset
# diabetes = load_diabetes()

# # Assign features (X) and target variable (y)
# X = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
# y = diabetes.target

# # Rename the columns
# column_names = {'age': 'Age', 'sex': 'Sex', 'bmi': 'BMI', 'bp': 'Blood Pressure',
#                 's1': 'Total Cholesterol', 's2': 'LDL', 's3': 'HDL',
#                 's4': 'Thyroid', 's5': 'Glaucoma', 's6': 'Glucose'}
# X.rename(columns=column_names, inplace=True)

# # Split the data into training and test sets
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# # Create a generalized additive model (GAM)
# model = LinearGAM(s(0) + f(1) + s(2) + s(3) + s(4) + s(5) + s(6) + s(7) + s(8) + s(9), n_splines=35).fit(X_train, y_train)

# # Make predictions on the test set
# y_pred = model.predict(X_test)

# # Evaluate the model's performance using mean absolute error (MAE)
# mae = mean_absolute_error(y_test, y_pred)
# print(f"Mean Absolute Error (MAE): {mae}")

# # Plot the effect of each patient feature on the target variable with 95% confidence intervals
# fig, axs = plt.subplots(2, X.shape[1] // 2, figsize=(15, 6))

# for i, feature in enumerate(X.columns):
#     row = i // (X.shape[1] // 2)
#     col = i % (X.shape[1] // 2)
#     XX = model.generate_X_grid(term=i)
#     axs[row, col].plot(XX[:, i], model.partial_dependence(term=i, X=XX))
#     axs[row, col].plot(XX[:, i], model.partial_dependence(term=i, X=XX, width=.95)[1], c='r', ls='--')
#     axs[row, col].set_title(feature)

# plt.tight_layout()
# plt.show()

Tasks

  1. Explain the features BMI and Blood Pressure (BP). Why is there a wider confidence interval around the lower and higher ends of BMI and BP?

  2. What can you say about the impact of higher LDL values (or bad cholesterol) on the target variable?

from sklearn.datasets import load_diabetes
import pandas as pd

# Load the diabetes dataset
diabetes = load_diabetes()

# Create a pandas DataFrame from the dataset
column_names = {'age': 'Age', 'sex': 'Sex', 'bmi': 'BMI', 'bp': 'Blood Pressure',
                's1': 'Total Cholesterol', 's2': 'LDL', 's3': 'HDL',
                's4': 'Thyroid', 's5': 'Glaucoma', 's6': 'Glucose'}
df = pd.DataFrame(X, columns=[column_names[col] for col in diabetes.feature_names])
df['Disease Progression'] = diabetes.target

print(df[(df['Total Cholesterol'] > 0.15) & (df['LDL'] > 0.19)])
          Age      Sex       BMI  Blood Pressure  Total Cholesterol       LDL  \
123  0.005383  0.05068  0.034751       -0.001091           0.152538  0.198788   

          HDL   Thyroid  Glaucoma  Glucose  Disease Progression  
123 -0.061809  0.185234  0.015568  0.07348                 84.0